This project used TMDb movie Dataset that contains about 10,000 movies collected from TMDb. Each movie includeed information on its user rating, revenue, cast, genres, etc. The goal of this investigation is to analyze insights abonout movie features to help movie industry to earn more profits.
The project follow the steps
Assess the Data: Download and upload the file to workbook.
Data Wrangling: Remove unnessary information such as empty and duplicate values. The whole dataframe was also fixed in order to summarize results.
Exploratory Data analysis: Study the statistics and build visualization to analyze relationships between each others.
Communicate the results: Use the above fundings to answer questions, then make conclusion.
Questions to be answered:
Which movies are the most profitable to the market? Which movies have the most and the least profit, budget and runtime? How does popularity affect the profit? Which years do movies made the most profit? what are the top casts, directors and genres? Which months have higher movies profits?
Importing Libraries
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.animation as animation
import seaborn as sns
from IPython.display import HTML
Load Data frame
df = pd.read_csv('tmdb-movies.csv')
In this section we will be gathering required data, Checking for usefull data, get more insights of data and the overall structure.
Checking the file for corruption:
df.head(3)
As the Column names and indexs are not jumbled or out of space, now we will concertrate more on data cleaning are gathering useful data from these.
df.shape
We'll be analyzing a dataframe with 21 columns and 10866 entries. We notice that the columns “cast”, “genres” and “production_companies” contain multiple values separated by the '|' character.
df.info();
df.describe()
There are several columns with missing values. For example, "homepage" has only 2936 values, and "tagline" has 8042 values. As, all these columns do not make any sense for further analyzing and can be dropped. Now, drop all unnecessary columns.
In this section we will be cleaning data by dropping not reuquired columns and changing the data types to required data types for better evaluation of data.
df.drop(['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview'], axis=1, inplace=True)
df.shape
The count of the columns have reduced to 15 from 21 in number, as we dropped our not required columns.
Next step will be to change Data Type of "release_date".
We notice that "release_date" Dtype is the object. It means the date is in string format. It's better to convert it to the special datetime format.
df.release_date = pd.to_datetime(df['release_date'])
#Checking the data for types now,
df.dtypes
To get more insights, a more descriptive statistics need to be generated.
Generate descriptive Statistics
# Learn more about data
df.describe()
It can be clearly seen that:
Replace Zero Values with "NaN"
#For budget column
df['budget']=df['budget'].replace(0, np.nan)
#For budget_adj column
df['budget_adj']=df['budget_adj'].replace(0, np.nan)
#For revenue column
df['revenue']=df['revenue'].replace(0, np.nan)
#For revenue_adj column
df['revenue_adj']=df['revenue_adj'].replace(0, np.nan)
#For runtime column
df['runtime']=df['runtime'].replace(0, np.nan)
# Check for changes
df.describe()
Use matplotlib inline to get more insights of data and the trends of graphs
df.hist(figsize=(12,10));
df.mean()
More points to note:
Histograms labeled “budget”, “revenue”, "popularity", "vote_count" are extremely right-skewed.
The max values of these columns stand out from all of the other numbers. For example, the mean value of "popularity" is around 0.64, the standard deviation is around 1.0, and 75% of the values are lower than 1.0, but the max value is almost 33!
df.info();
The columns with null values makes no sense to be dropped, we can use all the data to answer the questions.
Previously we mentioned that “genres” column contains multiple values separated by the '|' character. So we have to split them in order to create a list of all genres.
Splitting concated values:
genres = df.genres.str.split('|', expand=True).stack().value_counts().index
print("Number of genres is {}".format(genres.size))
We have 20 genres overall. Let's create a color map for them, so that every genre will have a unique color. Choosing colors is a very complicated task, so we’ll use the built-in matplotlib “tab20” colormap that has exactly 20 colors with a good-looking palette.
Creating color map for visulization
colors_map = {}
cm = plt.cm.get_cmap('tab20')
#we have 20 colors in [0-1] range
#so start from 0.025 and add 0.05 every cycle
#this way we get different colors for
#every genres
off = 0.025
for genre in genres:
colors_map[genre] = cm(off)
off += 0.05
Let's create a function that returns a sorted dataframe with dependency of values from a multiple value column and a single value column. This will help us to analyse all multiple values columns.
def get_mdepend(df, multival_col, qual_col):
#split column by '|' character and stack
split_stack = df[multival_col].str.split('|', expand=True).stack()
#convert series to frame
split_frame = split_stack.to_frame(name=multival_col)
#drop unneeded index
split_frame.index = split_frame.index.droplevel(1)
#add qual_col, group and find average
dep = split_frame.join(df[qual_col]).groupby(multival_col).mean()
#return sorted dependency
return dep.sort_values(qual_col)
Next step: Create a function that plots our horizontal bar chart with the popularity of movies for all genres up to the desired year.
def draw_barchart(current_year):
#get data only up to current_year
dep = get_mdepend(df.query('release_year <= {}'.format(current_year)),
'genres', 'popularity')
#clear before draw
ax.clear()
#plot horizontal barchart using our colormap
ax.barh(dep.index,
dep['popularity'].tolist(),
color=[colors_map[x] for x in dep.index])
#plot genres and values
dx = dep.max() / 200
for i, (value,
name) in enumerate(zip(dep['popularity'].tolist(), dep.index)):
#genre name
ax.text(value - dx,
i,
name,
size=14,
weight=600,
ha='right',
va='center')
#genre value
ax.text(value + dx,
i,
f'{value:,.2f}',
size=14,
ha='left',
va='center')
#big current year
ax.text(1,
0.2,
current_year,
transform=ax.transAxes,
color='#777777',
size=46,
ha='right',
weight=800)
#plot caption of ticks
ax.text(0,
1.065,
'Popularity',
transform=ax.transAxes,
size=14,
color='#777777')
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.1f}'))
ax.xaxis.set_ticks_position('top')
ax.tick_params(axis='x', colors='#777777', labelsize=12)
ax.set_yticks([])
ax.margins(0, 0.01)
ax.grid(which='major', axis='x', linestyle='-')
ax.set_axisbelow(True)
#chart caption
ax.text(0,
1.16,
'Popularity of movie genres from 1960 to 2015',
transform=ax.transAxes,
size=24,
weight=600,
ha='left',
va='top')
Creat bar chart and show
#create figure
fig, ax = plt.subplots(figsize=(10, 7))
#remove borders
plt.box(False)
#immediately close it to not provide additional figure
#after animation block
plt.close()
animator = animation.FuncAnimation(fig,
draw_barchart,
frames=range(1960, 2016),
interval=666)
#add space before animation
print('')
HTML(animator.to_jshtml())
The visulas clearly answer our question of most popular generes by year to year, seems like we are repeating the history.
Data Preperation
dfp = df[df.budget_adj.notnull() & df.revenue_adj.notnull()].copy()
dfp['profit'] = dfp.revenue_adj - dfp.budget_adj
Correlation
sns.pairplot(data=dfp,
x_vars=['popularity', 'budget', 'runtime'],
y_vars=['profit'],
kind='reg');
sns.pairplot(data=dfp,
x_vars=['vote_count', 'vote_average', 'release_year'],
y_vars=['profit'],
kind='reg');
popularity and vote_count have a positive correlation with profit. Obviously, the more people who watch the movie, the more revenue it gets.
budget has a small positive correlation with profit. So we can conclude that higher investments in movies cause higher revenues.
Surprisingly, vote_average has a weak positive correlation with profit.
Let's configure the default parameters for our plots, such as figure size and font sizes.
params = {
'legend.fontsize': 'x-large',
'figure.figsize': (10, 8),
'axes.labelsize': 'x-large',
'axes.titlesize': 'xx-large',
'xtick.labelsize': 'x-large',
'ytick.labelsize': 'x-large'
}
plt.rcParams.update(params)
Implementing a function to plot a profit chart
def profit_chart(df, title, type_of):
#create figure
ax = df.plot(kind='barh')
#remove legend from plot
ax.get_legend().remove()
#set custom axis formatter for millions
ax.xaxis.set_major_formatter(
ticker.FuncFormatter(lambda x, p: format(int(x / 1e6), ',')))
#set titles and labels
plt.title(title)
plt.xlabel('Average Profit ($ million)')
if type_of == 'genre':
plt.ylabel('Genres')
elif type_of == 'Company':
plt.ylabel('Production House')
else:
plt.ylabel('Actors')
Relation of profit with genre.
profit_chart(get_mdepend(dfp, 'genres', 'profit'), 'Movie Profits By Genre','genre')
Find the most profitable production house:
profit_chart(
#last 10 values
get_mdepend(dfp, 'production_companies', 'profit').tail(15),
'Top 15 Profitable Production Companies', 'Production House')
As can be seen Hoya Productions is the most profitable.
profit_chart(
get_mdepend(dfp, 'cast', 'profit').tail(10), 'Top 10 Profitable Actors', 'Actors')
Further is the analysis of Data based on the release year, to find if movies over time have become more proftibale.
year_profit_mean = dfp.groupby('release_year')['profit'].mean()
#rolling average over 10-years
year_profit_rol = year_profit_mean.rolling(window=10).mean()
ax = year_profit_rol.plot(figsize=(12, 8))
#set custom axis formatter
ax.yaxis.set_major_formatter(
ticker.FuncFormatter(lambda y, p: format(int(y / 1e6), ',')))
#set titles and labels
plt.title('Movie Profits By Year')
plt.ylabel('Average Profit ($ million)')
plt.xlabel("Release Year");
Lets create a comparison where we will determine if number of movies released per year as any effect ?
#we know that popularity column has no nulls
#so use it for counting the number of movies
ax = df.groupby('release_year')['popularity'].count().plot(figsize=(12, 8))
#set titles and labels
plt.title('Number Of Movies Per Year')
plt.ylabel('Number Of Movies')
plt.legend()
plt.xlabel("Release Year");
It is connected to the fact that the number of movies every year increased accordingly.
#assinging new dataframe which holds values only of movies having profit $50M or more
profit_movie_data = dfp[dfp['profit'] >= 50000000]
#reindexing new dataframe
profit_movie_data.index = range(len(profit_movie_data))
#will initialize dataframe from 1 instead of 0
profit_movie_data.index = profit_movie_data.index + 1
#since we have multiple questions answers being similar in logic and code, we will give function which will make our life easier
#function which will take any column as argument from which data is need to be extracted and keep track of count
def extract_data(column_name):
#will take a column, and separate the string by '|'
all_data = profit_movie_data[column_name].str.cat(sep = '|')
#giving pandas series and storing the values separately
all_data = pd.Series(all_data.split('|'))
#this will us value in descending order
count = all_data.value_counts(ascending = False)
return count
genre_count = extract_data('genres')
genre_count.head()
#we want plot to plot points in descending order top to bottom
#since our count is in descending order and graph plot points from bottom to top, our graph will be in ascending order form top to bottom
#hence lets give the series in ascending order
genre_count.sort_values(ascending = True, inplace = True)
#initializing plot
ax = genre_count.plot.barh(color = '#007482', fontsize = 15)
#giving a title
ax.set(title = 'The Most filmed genres')
#x-label
ax.set_xlabel('Number of Movies', color = 'g', fontsize = '18')
#giving the figure size(width, height)
ax.figure.set_size_inches(12, 10)
#shwoing the plot
plt.show()
Another amazing results. Action, Drama and Comedy genres are the most as visualized but Comedy takes the prize, about 492 movies have genres comedy which make $50M+ in profit. In comparison, even `Adventure` and `Thriller` really play the role. These five genres have more number of movies than rest of the genres as shown by visualization. Probability of earning more than \$50M for these genres are higher, but still other genres do count too again it depends on lots of other influential factors that come in play. Western, war, history, music, documentary and the most least foreign genres have less probability to make this much in profit as in comparison to other genre.
This also doesn't prove that if you have a movie with an Action, comedy and drama genre in it will have a guarantee to make more than $50M but it would have a significant interest and attraction to the population.
Let's find one more key characteristic of these movies.
#for answering this question we need to group all of the months of years and then calculate the profits of those months
#giving a new dataframe which gives 'release-date' as index
index_release_date = dfp.set_index('release_date')
#now we need to group all the data by month, since release date is in form of index, we extract month from it
groupby_index = index_release_date.groupby([(index_release_date.index.month)])
#this will give us how many movies are released in each month
monthly_movie_count = groupby_index['profit'].count()
#converting table to a dataframe
monthly_movie_count= pd.DataFrame(monthly_movie_count)
#giving a list of months
month_list = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
monthly_movie_count_bar = sns.barplot(x = monthly_movie_count.index, y = monthly_movie_count['profit'], data = monthly_movie_count)
#setting size of the graph
monthly_movie_count_bar.figure.set_size_inches(15,8)
#setting the title and customizing
monthly_movie_count_bar.axes.set_title('Number of Movies released in each month', color="r", fontsize = 25, alpha = 0.6)
#setting x-label
monthly_movie_count_bar.set_xlabel("Months", color="g", fontsize = 25)
#setting y-label
monthly_movie_count_bar.set_ylabel("Number of Movies", color="y", fontsize = 35)
#customizing axes values
monthly_movie_count_bar.tick_params(labelsize = 15, labelcolor="black")
#rotating the x-axis values to make it readable
monthly_movie_count_bar.set_xticklabels(month_list, rotation = 30, size = 18)
#shows the plot
plt.show()
#finding the second part of this question
#now since the data is grouped by month, we add 'profit' values to respective months, saving all this to a new var
monthly_profit = groupby_index['profit'].sum()
#converting table to a dataframe
monthly_profit = pd.DataFrame(monthly_profit)
#giving seaborn bar plot to visualize the data
#giving values to our graph
monthly_profit_bar = sns.barplot(x = monthly_profit.index, y = monthly_profit['profit'], data = monthly_profit)
#setting size of the graph
monthly_profit_bar.figure.set_size_inches(15,8)
#setting the title and customizing
monthly_profit_bar.axes.set_title('Profits made by movies at their released months', color="r", fontsize = 25, alpha = 0.6)
#setting x-label
monthly_profit_bar.set_xlabel("Months", color="g", fontsize = 25)
#setting y-label
monthly_profit_bar.set_ylabel("Profits", color="y", fontsize = 35)
#customizing axes values
monthly_profit_bar.tick_params(labelsize = 15, labelcolor="black")
#rotating the x-axis values to make it readable
monthly_profit_bar.set_xticklabels(month_list, rotation = 30, size = 18)
#shows the plot
plt.show()
Seeing the both visualizations of both graphs we see similar trend. Where there are more movie released there is more profit and vice versa but just not for one month i.e December. December is the month where most movie release but when compared to profits it ranks second. This means that december month has high release rate but less profit margin. The month of June where we have around 165 movie releases, which is second highest, is the highest in terms of making profits.
Also one more thing is we earlier finded which movie had made the most profit in our dataset, We came up with the answer of movie, 'Avatar', and the release month for this movie is in december, also the highest in loss movie had also released in december but that isn't being counted here. Knowing this that you have the highest release rate and highest profit making movie in same month of December but falls short in front of June month in terms of making profits makes me think that the month of June had movies with significant high profits where in december it didn't had that much high, making it short in terms of profit even though having the advantage of highest release rate.
This visualization doesn't prove us that if we release a movie in those months we will earn more $50M. It just makes us think that the chances are higher, again it depends on other influential factors, such as directors, story, cast etc.
Finally, we can summarize our findings and answer questions from the Introduction section. By looking at animated bar charts of genres and popularity, we can watch how movie trends changed over the years.
Animation and Adventure were the most popular genres and they competed with each other for the whole period of time. The least popular genres were Documentary and Foreign.
TV Movie was popular before 1970 but then rapidly became extremely unpopular.
Science Fiction was extremely unpopular before 1970 but then rapidly became one of the most popular genres.
Fantasy gradually increased in popularity and became very popular recently.
Properties of highly profitable movies:
Limitations:
It's not 100 percent guaranteed solution that this formula is going to work, meaning we are going to earn more than 50M USD! But it shows us that we have high probability of making high profits if we had similar characteristics as such.
All these directors, actors, genres and released dates have a common trend of attraction. If we release a movie with these characteristics, it gives people high expectations from this movie. Thus attracting more people towards the movie but it ultimately comes down to story mainly and also other important influential factors. People having higher expectations gives us less probability of meeting their expectations. Even if the movie was worth, people's high expectations would lead in biased results ultimately effecting the profits. We also see this in real life specially in sequels of movies. This was just one example of an influantial factor that would lead to different results, there are many that have to be taken care of.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])